Dataset Attributes:

Importing Necessary Libraries

Load and Explore the Data

Feature Engineering:

Fixing DataTypes

Summary of Numerical Columns

Processing Columns

Missing Values :

Exploratory Data Analysis:

Univariate Analysis - Numerical Columns:

Observations:

Insights:

Univariate Analysis - Categorical Columns:

Observations:

Correlation Matrix

Observations:

Observations:

Bivariate Analysis

Observations:

Multi-variate Analysis

Observations:

Data Pre-Processing:

Outliers Treatment:

Model Building

Model Evaluation Criterion

Model can make two kinds of wrong predictions:

  1. Wrongly Identify customers as loan borrowers but they are not - False Positive
  2. Wrongly identifying customers as not borrowers but they actually buy loans - False Negative

How to reduce losses

Creating a Confusion Matrix

Logistic Regression (with Sklearn library)

Observations:

Logistic Regression Using Stats Model:

Observations:

Checking for Multicollinearity using VIF Scores:

Observations:

Observations:

Variable Significance:

Insights

Hence, we will use lg4 as the final model

Observations from Model:

Coefficient Interpretations:

Converting Coefficients to odds:

Odds ratio = Exp(coef)

Probability = odds/(1+odds)

Observations:

Identifying Key Variables:

Confusion matrix Prediction on lg4 model Test Data

Observations:

Model Performance Improvement

AUC-ROC curve:

Optimal Threshold from AUC-ROC

Observations

Percision-Recall Curve

Observations

Sequential Feature Selector method:

Observations:

Model Building - Decision Tree:

Approach

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on test set.

Split Data

Model Building

Observations:

Model Evaluation Criteria - Recall

Visualizing the Decision Tree

Observations:

Observations:

Reduce Over-Fitting:

GridSearch for Hyperparameter tuning of Tree Model

Observations:

Visualizing the Tree

Observations:

Cost Complexity Pruning

Finding the ccp_alpha values

Let's plot the Recall Vs Alpha values for both Train and Test set

Maximum Recall value is at alpha 0.0042. But at this alpha we will lose valuable business information and the decision tree might have very less nodes.

Hence we will use the point where the Recall values just begins to drop first; at alpha = 0.003. This will ensure we are retaining information and also get a high recall value.

Observations:

Observations:

Visualizing Decision Tree for best_model2

Comparison of all Models for Personal_Loan prediction

Conclusion:

Misclassification of model:

Analysing predictions that were off the mark

OBSERVATIONS: